Disease is an unavoidable component of existence, impacting not only people but all living things. Everyone will encounter one or more diseases over their lives, whether inherited or induced by external circumstances. To battle many diseases produced by external chemicals that enter our systems, our bodies have built-in immune responses and medicinal therapies.
The human species has experienced countless diseases throughout history, and as our lives change, diseases develop and take on new shapes. The recent Covid-19 outbreak exemplifies the deadly impact a disease had impact on a worldwide scale. Pandemics, on the other hand, are not a new phenomena; throughout history, the globe has been shook by many illnesses that spread and reduce human life expectancy.
This Analysis will look at a large Dataset that has information on 50 illnesses that occurred in every state of the United States between 1888 and 2014. The information comes from Project Tycho, which works with researchers and national and international health institutes to provide open, free data for public use in analysis and research.
When dealing with such a massive volume of data, visualization becomes a vital tool. Visualizing data with graphical representations allows for a more meaningful comprehension and insights than combing over massive Excel spreadsheets.
The data from Tycho 2.0 may be subjected to analysis to determine whether any secrets are there. This data presents a plethora of possibilities for research on 50 illnesses that had an impact on the United States, but it must be restricted to certain locations in order to achieve any high-quality analysis from this study.
The study’s goal is to pinpoint the most common illness that afflicted Americans from 1888 to 2014 and to ascertain which states were most impacted. The study attempts to identify the illness type with the greatest prevalence rate nationwide and investigate its effects on various states by evaluating the data.
The purpose is to explore how infections spread among the 50 states in the United States. The first step is deciphering and comprehending data formats, columns, and the data itself. This information will be useful for performing statistical analysis, making visualizations, and deriving conclusions about the relationships between events and their effects. One may gain a thorough grasp of how diseases spread by looking at these connections.
Statistical language - R offers simple and quick tools for transforming data into aesthetically interesting components such as graphs. The graphs make the data easier to read and comprehend. This is a list of the several sorts of graphs that are plotted here with descriptive statistical methods and ggplot2.
Map plot: Used to display the data in the US map using Plotly package and Tmap and Mapview.
Geom bar with Geom Points: Used to display the data in bars and geom with mean value.
Scatter Plot: Represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between years and cases or deaths.
Grouped Bar plot: A grouped bar chart re-presents categorical data with rectangular bars with heights or lengths proportional to the values with comparison of two or more variables that they represent.
Heat Map: A heat map is a two-dimensional representation of data in which values are represented by color gradients.
Box plot with Animation: A box plot shows the distribution of continues data and it visualizes five summary statistics. with animation in plot it creates easy to understand representation.
Below is a list of libraries that were utilized in this analysis.
library (ggplot2)
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (tmap, mapview)
## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0
library (plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library (viridis)
## Loading required package: viridisLite
library (ggridges)
library (readr)
library (usmap)
library (gapminder)
library (ggthemes)
library (gganimate)
library (png)
library (gifski)
options(scipen=999999) # to disables scientific notion
The code below is used to set the directory where work is done and then create a new file with the data that wants to be evaluated.
setwd("K:/AI & DS/Data Visualisation/Tycho2") # Setting a directory for Project Work space.
US <- read.csv("ProjectTycho_Level2_v1.1.0.csv", header = T, stringsAsFactors = T) # read.csv function loads a data in work space.
The next stage is to examine the data and examine its size and contents. As the data is sorted and cleaned up, it is ready for analysis.
dim(US) # It shows dimension of our data frame.
## [1] 3659360 11
US <- US %>% arrange(epi_week) # Arranging data according years.
tail(US) # Showing Bottom of our data.
US <- US[ , -2] # removing Unwanted columns
US <- US[ , -10]
str(US) # looking at data types of our data set
## 'data.frame': 3659360 obs. of 9 variables:
## $ epi_week : int 188737 188823 188823 188823 188824 188824 188824 188824 188824 188824 ...
## $ state : Factor w/ 57 levels "AK","AL","AR",..: 20 39 39 39 42 42 42 39 39 42 ...
## $ loc : Factor w/ 630 levels "ABERDEEN","ADAMS",..: 325 123 123 123 465 465 465 123 123 465 ...
## $ loc_type : Factor w/ 2 levels "CITY","STATE": 1 1 1 1 1 1 1 1 1 1 ...
## $ disease : Factor w/ 50 levels "ANTHRAX","BABESIOSIS",..: 46 46 36 11 46 36 11 46 11 38 ...
## $ event : Factor w/ 2 levels "CASES","DEATHS": 2 2 2 2 2 2 2 2 2 2 ...
## $ number : int 3 1 1 5 14 4 4 3 5 3 ...
## $ from_date: Factor w/ 10557 levels "1887-09-09","1888-06-03",..: 1 2 2 2 3 3 3 3 3 3 ...
## $ to_date : Factor w/ 10557 levels "1887-09-15","1888-06-09",..: 1 2 2 2 3 3 3 3 3 3 ...
US <- US |> filter(!is.na(number)) # filtering N/A value.
US <- subset(US, !(state %in% c ("AS", "GU", "MP", "PR", "PT", "VI"))) # Deleting Union territory from State Column
US$to_date <- as.Date(US$to_date, format = "%Y-%m-%d") # changing the format of factor to date
US$from_date <- as.Date(US$from_date, format = "%Y-%m-%d")
str(US) # checking the data type of each column
## 'data.frame': 3650344 obs. of 9 variables:
## $ epi_week : int 188737 188823 188823 188823 188824 188824 188824 188824 188824 188824 ...
## $ state : Factor w/ 57 levels "AK","AL","AR",..: 20 39 39 39 42 42 42 39 39 42 ...
## $ loc : Factor w/ 630 levels "ABERDEEN","ADAMS",..: 325 123 123 123 465 465 465 123 123 465 ...
## $ loc_type : Factor w/ 2 levels "CITY","STATE": 1 1 1 1 1 1 1 1 1 1 ...
## $ disease : Factor w/ 50 levels "ANTHRAX","BABESIOSIS",..: 46 46 36 11 46 36 11 46 11 38 ...
## $ event : Factor w/ 2 levels "CASES","DEATHS": 2 2 2 2 2 2 2 2 2 2 ...
## $ number : int 3 1 1 5 14 4 4 3 5 3 ...
## $ from_date: Date, format: "1887-09-09" "1888-06-03" ...
## $ to_date : Date, format: "1887-09-15" "1888-06-09" ...
sort(summary(US$state)) # Total entries per each state
## AS GU MP PR PT VI AK HI WY MS NV
## 0 0 0 0 0 0 5272 8436 11875 15383 18084
## SD AZ ID NM DE ND ME UT VT DC OR
## 26406 27839 28626 29091 30749 36092 36830 37563 38571 40058 41847
## OK NE AR NH RI IA SC WV KY CO FL
## 42851 44666 45668 47375 52831 58812 59588 59984 61325 62440 62674
## KS LA NC MT AL WA GA TN MD MN CT
## 64734 66134 71351 72488 74015 74067 75527 78327 80560 81114 87596
## MO VA MI IL WI IN TX NJ CA OH PA
## 88793 95074 110275 111040 111533 117738 120102 137949 139075 168281 175361
## NY MA
## 186258 232016
sort(summary(US$disease)) # Total Entry on each disease.
## ENCEPHALITIS BOTULISM
## 5 10
## VARIOLOID BABESIOSIS
## 21 36
## YELLOW FEVER CHOLERA
## 75 125
## EHRLICHIOSIS/ANAPLASMOSIS DENGUE
## 136 621
## TRICHINIASIS PSITTACOSIS
## 737 824
## COCCIDIOIDOMYCOSIS TOXIC SHOCK SYNDROME
## 1066 1256
## TETANUS STREPTOCOCCAL DISEASE, INVASIVE GROUP A
## 1732 3306
## STREPTOCOCCAL SORE THROAT LYME DISEASE
## 4587 4706
## LEGIONELLOSIS ANTHRAX
## 6686 7051
## CRYPTOSPORIDIOSIS LEPROSY
## 7589 7677
## SHIGELLOSIS GIARDIASIS
## 8651 10142
## DYSENTERY TYPHUS FEVER
## 10493 11865
## MALARIA SALMONELLOSIS
## 12216 12737
## BRUCELLOSIS [UNDULANT FEVER] TULAREMIA
## 14957 15984
## CHLAMYDIA ROCKY MOUNTAIN SPOTTED FEVER
## 16524 18655
## GONORRHEA RUBELLA
## 23268 25843
## MENINGITIS RABIES IN ANIMALS
## 29759 30304
## PELLAGRA HEPATITIS B
## 33244 49667
## HEPATITIS A CHICKENPOX [VARICELLA]
## 50526 74434
## MUMPS POLIOMYELITIS
## 87771 140606
## PNEUMONIA WHOOPING COUGH [PERTUSSIS]
## 213518 229798
## INFLUENZA PNEUMONIA AND INFLUENZA
## 236673 239793
## SMALLPOX TYPHOID FEVER [ENTERIC FEVER]
## 274862 304320
## SCARLET FEVER TUBERCULOSIS [PHTHISIS PULMONALIS]
## 344805 346515
## MEASLES DIPHTHERIA
## 354973 379195
Events <- US %>% filter(number >= 1) # Separating values which shows at least one case or death.
Events$epi_week <- substr(Events$epi_week,1,4)
colnames(Events)[1] <- 'year' # Removing epi_week and making year column instead of year+ week.
Events$year <- as.numeric(Events$year)
Event_Case <- subset(Events, event == "CASES", select = c(year, state, disease, number))# Separating Cases from events column
Event_Death <- subset(Events, event == "DEATHS", select = c(year, state, disease, number ))# # Separating Deaths from events column
This first graphical plot, an interactive choropleth map, displays all cases for all 50 States in the United States from 1888 to 2014 year wise. It explains why particular locations or sets of states, like New York, have lighter hues more frequently in the US’s northwest. West coast states like Texas and California give the south a menacing vibe. States from the US’s core geographic position, on the other hand, have dark hues, signifying fewer incidences between 1888 and 2014.
# Preparing Data for Graph
year_state_Case <- Event_Case %>% select(year, state, number)
year_state_Case <- aggregate(year_state_Case, number ~ year + state, sum)
# Plotting a Graph according filtered Data
plot_geo(year_state_Case, locationmode = 'USA-states', frame = ~ year) %>%
add_trace(locations = ~ state, z = ~number, zmin = 0, zmax = max(year_state_Case$number), color = ~number, colorscale = 'PuBu') %>%
layout(geo = list(scope = 'usa'), title = "Choropleth map for Cases with all disease by each year")
Only cases are shown in the plot at the top. It is advised that mortality rates for all diseases combined be compared for each state in the US in order to ascertain if the data is distributed throughout all states or not. Comparing cases and deaths for each state upholds the idea that a spike in cases would lead to a rise in fatalities, however certain states (WI, WA) don’t do this, leading us to assume that the prevalence of high fatality illnesses is uneven. Below plot just represents apposite events means deaths on each state of US.
# Preparing Data for Graph
year_state_Death <- Event_Death %>% select(year, state, number)
year_state_Death <- aggregate(year_state_Death, number ~ year + state, sum)
# Plotting a Graph according filtered Data
plot_geo(year_state_Death, locationmode = 'USA-states', frame = ~ year) %>%
add_trace(locations = ~ state, z = ~number, zmin = 0, zmax = max(year_state_Death$number), color = ~number, colorscale = 'Electric') %>%
layout(geo = list(scope = 'usa'), title = "Choropleth map for Deaths with all disease by each year ")
Two plots are merged for the outcome in this plot. The mean value for the cases of each disease is displayed in the plot below. The results shows that Chlamydia disease has the greatest mean value and Yell fewer disease has the lowest mean value.
# Summary of mean value to all Disease
Mean_Dise <- Event_Case %>% group_by(disease) %>% summarise(number = mean(number)) # summarizing mean value
ggplot(Mean_Dise, aes(number, disease)) + # assigning axis value
geom_bar(stat = "identity", fill = 'gray35', alpha = 0.8, width = 0.6) + geom_point(color = "gray0", size = 3) + # arrangements
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
theme_minimal() + # theme of the plot
labs(title ="Mean bar graph on Cases", subtitle = 'log10 scale on x axis' , x = 'Number of Cases') +
scale_x_log10(breaks = c(10,20,40,80,160,320)) # scale factor for x axis
Over the duration of the observation period, the scatter plot compares the cases.This graph uses a log10 scale to compare all Dieses cases across all 50 states. Between 1930 to 1960, there was an excess of cases in every state as per result. There is Data missing for 4 year from 2002 to 2005. The cases increase again after that, this time around year 2010.
ggplot(data = year_state_Case) +
geom_jitter(aes(x =year, y = number, color = state), alpha=.4, size = 1) +
labs(title ='Comparing Cases according states by year', subtitle = 'log10 on Y axis' , y = 'Number of Cases') +
scale_y_log10() +
theme_minimal() +
theme(legend.position = c(0))
This graph compares the Deaths throughout the course of the observation period using a scatter plot. This graph compares all deaths events among all 50 states using a log10 scale.As a result, the Record does not contain any death information from 1948 until around 1965. So we may claim that the data record provided is not entirely accurate.
ggplot(data = year_state_Death) +
geom_jitter(aes(x =year, y = number, color = state), alpha=.4, size = 1) +
labs(title ='Comparing Death according states by year',
subtitle = 'log10 on Y axis' , y = 'Number of Deaths') +
scale_y_log10() +
theme_minimal() +
theme(legend.position = c(0))
The diseases with up to 50,000 documented cases are represented in this bar graph, along with the cases and deaths connected to each ailment. 25 diseases with up to 50,000 cases were found in the studies.Due to the fact that many diseases have fewer recorded cases than reported fatalities, the data is not entirely accurate.The shocking part is that there have been roughly 40,000 documented deaths from influenza and pneumonia combined when there were 0 reported cases.
Dise <- Events %>%
count(disease,event, wt = number) %>%
pivot_wider(names_from = event, values_from = n, values_fill = 0)
Dise50_Plus <- Dise[Dise$CASES >= 50000, c("disease","CASES","DEATHS")] %>%
pivot_longer(cols = DEATHS : CASES, names_to = "event", values_to = "number")
Dise50 <- Dise[Dise$CASES < 50000, c("disease","CASES","DEATHS")] %>%
pivot_longer(cols = DEATHS : CASES, names_to = "event", values_to = "number")
ggplot(Dise50, aes(x = reorder(disease, number) , y = reorder(number, disease), fill= event))+ # Assigning axis
geom_bar(position='dodge', stat='identity') + # Dodge plot
scale_fill_manual(values = c("#CCCC00", "#666600")) + # color
labs(title ="Up to 50,000 cases and deaths reported per Disease", x = 'Disease' , y = 'Numbers') + # Title of the plot
theme(text = element_text(size = 8),axis.text.x = element_text(angle = 30, hjust = 1),legend.position = c(0.1, 0.8)) # Theme with adjustment
This Grouped bar graph shows the diseases with more than 50,000 reported instances, In comparison of cases and deaths associated with each disease. In the findings, 24 illnesses with more than 50,000 cases were identified. Moreover, the records show that there were 277655 reported cases from the measles which is highest. Additionally, according to the provided data, there are around 70,000 cases of pneumonia and 73,000 deaths from the disease.Therefore, it is clear that more deaths from pneumonia than cases have been reported. Therefore, this information might not be totally accurate for some disease.
ggplot(Dise50_Plus, aes(x = reorder(disease, number) , y = reorder(number, disease), fill= event))+ # Assigning axis
geom_bar(position='dodge', stat='identity') + # Dodge plot
scale_fill_manual(values = c("#3366CC", "#333399")) + # color
labs(title ="More than 50,000 cases and deaths reported per Disease", x = 'Disease' , y = 'Numbers') + # Title of the plot
theme(text = element_text(size = 8),axis.text.x = element_text(angle = 30, hjust = 1),legend.position = c(0.1, 0.8)) # Theme with adjustment
The heat map shows how almost all states were greatly influenced by all illnesses. Almost all states were affected, where if we compare with scatterplot 1 this plot gives more detailed view for affection over state on particulate year. The darker hue in the outcome indicates greater influence.
year_state_Case$Create <- cut(year_state_Case$number, breaks = c(0,100,1000,10000,100000,1000000,5000000))
# Heatmap
ggplot(year_state_Case, aes(x=year, y = state , fill= number))+ # assigning axis
geom_tile(aes(fill= Create)) +
labs(title ='Heatmap of Cases on states per year', y = 'States') +
scale_fill_manual(name = "Numbers", values = c("#FF9966","#FF3333","#CC0000","#990000","#660000", "#330000"),
labels = c("^100", "^1000", "^10,000","^1,00,000","^10,00,000") ) + # color with labels
theme(text = element_text(size = 8)) # theme of the plot
Events$number <- is.integer(Events$number)
Both_Event <- Events %>%
select( state, disease, event, number) %>%
pivot_wider(names_from = event, values_from = number , values_fn = {sum}) %>% # Transferring rows to cols
group_by(CASES, DEATHS) %>%
filter(CASES > 0 & DEATHS > 0 ) %>% # Removing 0 values
pivot_longer( cols = DEATHS : CASES, names_to = "event", values_to = "number") # Again Transferring cols to rows
The bellowed box plot shows comparison between cases and deaths for 14 disease. To compare Fatal disease we compare cases with dates using box plot. With box Plot it can be Identified that CHICKENPOX [VARICELLA] and BRUCELLOSIS [UNDULANT FEVER] recorded very High cases as compare to deaths so this disease aren’t deadly But the PNEUMONIA and TUBERCULOSIS [PHTHISIS PULMONALIS] have more death cases recorded as compare to cases which can be seen from quartile values of the box.
Box <- ggplot(Both_Event, aes(x = disease, y = number, fill = event)) +
geom_boxplot( width = 0.6) + scale_y_log10() +
labs(x = "Disease", y = "Number") + scale_fill_manual(values= c("#666600", "#990000")) +
ggtitle("Comparison of Deaths and Cases by Disease") +
theme(text = element_text(size = 10),axis.text.x = element_text(angle = 15, hjust = 1),legend.position = c(0.95, 0.9))
Box.animation = Box + # box plot with Animation
transition_states(disease, wrap = FALSE) +
shadow_mark(alpha = 0.7) +
enter_grow() +
exit_fade() +
ease_aes('back-out')
Box.animation # To see animation plot
After this nominal analysis we found that there are several points to consider as a result. First fall data aren’t accurate or reliable for guarantied analysis because there are more than 50 states plus some disease have given only number of cases and some have have only deaths which is surly questionable that how figures of deaths noticed without figures of cases.
New York and Texas are highly affected states by cases as compare to other states. But in terms of mortality there are less impact on Texas compare to mortality rate in New York.
In terms of disease Measles have the highest cases approx 25309922, Also there isn’t any deaths recorded for Measles so we can’t create our hypothesis on the mortality impacts by Measles.
In addition there are several years haven’t recorded in Tycho project which is also a big factor to inhibits further analysis.
The information provided by the US health department was practically defective and comprised state names, dates, and hours. In order to extract useful insights from the data, the numbers were further separated into events that discriminated between cases and fatalities as well as contained city and state names. The exact weekly time spans of the data allowed for analysis based on a series of occurrences.
The actual data shows no missing numbers, however a closer look reveals that the data lacks information on the death rates associated with a number of illnesses. just mention a few, there are chlamydia, gonorrhea, measles, mumps, and rubella. However, data on some illnesses only include fatalities. Consider the diseases varioloid, cholera, influenza and pneumonia.
When just cases were included, the research results showed that measles was the most common illness; however, if data on fatalities had been added, the technique may have yielded more accurate forecasts of mortality owing to measles.
This graphical study has shown that numerous illnesses, including meningitis, tuberculosis, pneumonia, and pellagra, have a high mortality rate in the United States, but the data shows that measles is the most prevalent disease among the 50.